GH-48701: [C++][Parquet] Add ALPpd encoding by prtkgaur · Pull Request #48345 · apache/arrow

prtkgaur · 2025-12-05T00:23:45Z

@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

Spec

Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md

Parquet Format PR

Dataset PR (parquet-testing)

apache/parquet-testing#100

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

Alp h/cc : Houses core logic for encoding and decoding.
Sampler h/cc : Houses logic to sample and select parameters for encoding.
AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Are these changes tested?

We have added unit tests to test the code.
Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Unit tests

alp_test.cc

Benchmark tests

encoding_benchmark.cc and encoding_alp_benchmark.cc

Are there any user-facing changes?

It's a new encoding so the only impact is query performance which we claim will only get better.

DuckDB

We did look at DuckDB's ALP implementation while we were implementing ALP and would like to give that team the desired credit.

GitHub Issue: [ALP][Parquet] Add C++ implementation of ALPpd encoder/decoder #48701

github-actions · 2025-12-05T00:24:08Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

alamb · 2025-12-08T14:04:07Z

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Makes sense. Thanks.
apache/parquet-testing#100

alamb · 2025-12-08T14:07:24Z

    DELTA_BYTE_ARRAY = 7,
    RLE_DICTIONARY = 8,
    BYTE_STREAM_SPLIT = 9,
+    ALP = 10,


https://github.com/apache/arrow/blob/main/cpp/src/parquet/parquet.thrift#L631 needs to be updated here and in parqut-format.

For parquet-format we have this PR : apache/parquet-format#557

alamb · 2025-12-08T14:07:48Z

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

prtkgaur · 2025-12-08T22:16:49Z

Makes sense. Thanks.
apache/parquet-testing#100

prtkgaur · 2025-12-08T23:31:26Z

+    std::string tarball_path = std::string(__FILE__);
+    tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
+    tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
+    tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";


@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100

prtkgaur · 2025-12-08T23:41:27Z


+  // Unsafe resize without initialization - use only when you will immediately
+  // overwrite the memory (e.g., before memcpy). Only safe for POD types.
+  void UnsafeResize(size_t n) {


Using this over resize gave us around 2-3% performance improvement

Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>

…racters

Also ensure that no line exceeds 90 characters

emkornfield · 2026-06-04T04:40:45Z

+class AlpEncodedVector {
+ public:
+  /// ALP-specific metadata (exponent, factor, num_exceptions)
+  AlpEncodedVectorInfo alp_info;


nit: class member names end in '_' (https://google.github.io/styleguide/cppguide.html#Variable_Names)

Agreed. Done

emkornfield · 2026-06-04T04:42:23Z

+}
+
+// ----------------------------------------------------------------------
+// AlpMetadataCache (LEGACY - not used with offset-based layout)


Can this be removed then?

Yes, AlpMetadataCache is legacy from a prior layout that was replaced by the offset-based interleaved format. I'll remove it.

emkornfield · 2026-06-04T04:43:56Z

+  /// The format supports arbitrary power-of-2 sizes via log_vector_size in the
+  /// page header, but this implementation currently only supports 1024.
+  /// Must fit in uint16_t (max 65535), so log_vector_size must be <= 15.
+  static constexpr int64_t kAlpVectorSize = 1024;


I thought you said C++ now supports arbitrary sizes?

kAlpVectorSize is just the default — it's the value used when the caller doesn't pass an explicit vector_size. The encode API accepts any power-of-2 up to 2^kMaxLogVectorSize, and the decode path reads the actual vector size from the page header, so it handles any size transparently.

I've updated the comment to make this clearer.

emkornfield · 2026-06-04T04:45:46Z

+///
+/// \tparam T the type of data to be compressed. Currently float and double.
+template <typename T>
+class AlpCompression : private AlpConstants {


nit: we shouldn't be using private inheritence

emkornfield · 2026-06-04T04:46:59Z

+  alp_info.Store({output_buffer.data() + offset, AlpEncodedVectorInfo::kStoredSize});
+  offset += AlpEncodedVectorInfo::kStoredSize;
+
+  // Store ForInfo (6/10 bytes)


5 or 9? maybe just remove the detail?

Suggested change

// Store ForInfo (6/10 bytes)

// Store ForInfo

Yes will remove the detail

emkornfield · 2026-06-04T04:47:34Z

+  std::memcpy(output_buffer.data() + offset, exceptions.data(), exception_size);
+  offset += exception_size;
+
+  ARROW_CHECK(offset == data_size)


same comment about safety.

emkornfield · 2026-06-04T04:47:55Z

+          {input_buffer.data() + input_offset, AlpEncodedVectorInfo::kStoredSize}));
+  input_offset += AlpEncodedVectorInfo::kStoredSize;
+
+  // Load ForInfo (6/10 bytes)


Suggested change

// Load ForInfo (6/10 bytes)

// Load ForInfo

emkornfield · 2026-06-04T04:52:45Z

+  const int64_t bit_packed_size =
+      AlpEncodedForVectorInfo<T>::GetBitPackedSize(num_elements, result.for_info.bit_width);
+
+  result.packed_values.resize(bit_packed_size);


is this in the hot path? did zeroing out values show up in any profiling? Maybe we can leave a TODO to re-examine?

for now I'm adding a TODO. (I remember seeing the zeroing showing up, but will have to rerun setup to validate)

emkornfield · 2026-06-04T04:57:09Z

+  ptr += sizeof(result.frame_of_reference);
+
+  // bit_width: 1 byte
+  result.bit_width = *ptr;


validate bit_width is <= target size.

emkornfield · 2026-06-04T05:02:05Z

+  /// \param[in] input the compressed buffer
+  /// \param[in] input_size the size of the compressed data
+  /// \return the AlpHeader, or an error if the buffer is too small
+  static Result<AlpHeader> LoadHeader(const char* input, int64_t input_size);


nit: consistency on char/uint_8

emkornfield · 2026-06-04T05:03:14Z

+// AlpCodec implementation
+
+template <typename T>
+auto AlpCodec<T>::LoadHeader(const char* input, int64_t input_size)


is auto needed here? It can just return AlpHeader directly?

It was a large return type Result<typename AlpCodec<T>::AlpHeader> auto looked more readable to me.

emkornfield · 2026-06-04T05:03:54Z

+  header.compression_mode = static_cast<uint8_t>(input[0]);
+  header.integer_encoding = static_cast<uint8_t>(input[1]);
+  header.log_vector_size = static_cast<uint8_t>(input[2]);
+  std::memcpy(&header.num_elements, input + 3, sizeof(header.num_elements));


SafeLoad ...

emkornfield · 2026-06-04T05:06:40Z

+  header.integer_encoding = static_cast<uint8_t>(input[1]);
+  header.log_vector_size = static_cast<uint8_t>(input[2]);
+  std::memcpy(&header.num_elements, input + 3, sizeof(header.num_elements));
+  return header;


validate log_vector_size, compression_mode and integer encoding here? Also, num_elements > 0

So like CHECK? DCHECK? or just a trivial if condition?

For now going with an if condition which return invalid status if the check's fail.

emkornfield · 2026-06-04T05:09:22Z

+template <typename T>
+auto AlpCodec<T>::CreateSamplingPreset(const T* input, int64_t input_size)
+    -> AlpSamplerResult {
+  ARROW_CHECK(input_size >= 0 && input_size % sizeof(T) == 0)


should this just take a span T, or alternatively should input_size be number of elements to to begin with?

span would be nice for type safety but the parquet encoder holds data as raw bytes in a BufferBuilder, so it would still need a reinterpret_cast to construct the span.

I went with the other suggested approach. It's consistent with the decode path which already takes element count.

emkornfield · 2026-06-04T05:11:19Z

+                                     const AlpSamplerResult& preset,
+                                     int32_t vector_size,
+                                     char* output, int64_t* output_size) {
+  ARROW_CHECK(input_size >= 0 && input_size % sizeof(T) == 0)


same question, can Span be used instead?

went with num_elements

emkornfield · 2026-06-04T05:12:59Z

+  header.compression_mode = static_cast<uint8_t>(AlpMode::kAlp);
+  header.integer_encoding = static_cast<uint8_t>(AlpIntegerEncoding::kForBitPack);
+  header.log_vector_size = AlpHeader::Log2(vector_size);
+  header.num_elements = static_cast<int32_t>(element_count);


passing element_count directly avoids could avoid the down cast? we probably. If we are allowing freedom of int64, we probably want to check this is a safe truncation?

Makes sense.

Added a check for num_elements <= INT32_MAX before the truncating cast to int32_t for the header field. The public API takes int64_t for consistency with Arrow conventions, but the on-disk header stores it as int32_t (matching Parquet page sizes), so we validate at the boundary.

emkornfield · 2026-06-04T05:13:25Z

+  encoded_header[0] = header.compression_mode;
+  encoded_header[1] = header.integer_encoding;
+  encoded_header[2] = header.log_vector_size;
+  std::memcpy(encoded_header + 3, &header.num_elements, sizeof(header.num_elements));


emkornfield · 2026-06-04T05:14:17Z

+Status AlpCodec<T>::Decode(int32_t num_elements, const char* input, int64_t input_size,
+                             TargetType* output) {
+  ARROW_ASSIGN_OR_RAISE(const AlpHeader header, LoadHeader(input, input_size));
+  if (header.log_vector_size > AlpConstants::kMaxLogVectorSize) {


also, less than? Move this into the LoadHeader function?

Go it.
LoadHeader probably the right place for these checks

emkornfield · 2026-06-04T05:15:09Z

+  const char* body = input + AlpHeader::kSize;
+  const int64_t body_size = input_size - static_cast<int64_t>(AlpHeader::kSize);
+
+  if (header.GetCompressionMode() != AlpMode::kAlp) {


same comment, consider doing all validation in one place.

emkornfield

Still reviewing but wanted to flush comments for what I have so far.

- Replace std::memcpy with util::SafeLoadAs/SafeStore for all single-value loads/stores from uint8_t* in alp.cc and alp_codec.cc - Convert AlpEncodedVectorInfo and AlpEncodedForVectorInfo from struct to class per Google C++ style guide (private data with trailing underscore, public getters/setters) - Add bit_width validation in AlpEncodedForVectorInfo::Load - Fix incorrect comment "(6/10 bytes)" → remove byte count detail - Add safety comments on ARROW_CHECK assertions in Store paths - Add TODO for resize() zero-initialization on decode hot path - Make AlpMode::kAlp explicitly = 0

…with getters Per Google C++ style guide, class data members use trailing underscores. Convert AlpEncodedVector<T> and AlpEncodedVectorView<T> (struct→class) with private members, const getters, mutable getters for vectors, and setters. Updates all ~95 call sites across alp.cc, alp_codec.cc, and alp_test.cc.

AlpMetadataCache was designed for an older grouped metadata layout that has been superseded by the offset-based interleaved format. The codec reads offsets and metadata inline, making this cache unnecessary. Also removes GetNumElements() which is now redundant with the num_elements() getter added in the prior commit.

Use explicit AlpConstants:: qualification instead of private inheritance per reviewer feedback. Private inheritance is discouraged as it obscures the relationship between classes.

…eader

Reject invalid compression_mode, integer_encoding, log_vector_size, and negative num_elements when loading the ALP page header.

Replace input_size (bytes) with num_elements across all AlpCodec encode APIs. This removes the sizeof(T) divisibility precondition, simplifies callers, and makes the encode path consistent with the decode path which already takes element count. Also consolidates validation: encode checks are in EncodeWithPreset, decode checks are in LoadHeader. Adds INT32_MAX bounds check before header truncation, and uses SafeStore for all header field writes.

Aligns with Arrow buffer conventions (Buffer::data() returns uint8_t*). This eliminates reinterpret_casts at parquet encoder/decoder call sites. Also updates kAlpVectorSize comment to reflect that arbitrary power-of-2 vector sizes are supported (1024 is just the default, not a limitation).

github-actions Bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 5, 2025

prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46

alamb reviewed Dec 8, 2025

View reviewed changes

alamb mentioned this pull request Dec 8, 2025

[Parquet] Prototype ALP encoding apache/arrow-rs#8748

Open

prtkgaur changed the title ~~[Gh540] Add ALPpd encoding to parquet~~ [Gh539] Add ALPpd encoding to parquet Dec 8, 2025

prtkgaur commented Dec 8, 2025

View reviewed changes

prtkgaur changed the title ~~[Gh539] Add ALPpd encoding to parquet~~ [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025

prtkgaur changed the title ~~[Gh539][Encoding] Add ALPpd encoding to parquet~~ [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025

sfc-gh-pgaur and others added 14 commits December 8, 2025 23:47

Add alp code

0d442fd

Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>

Integrate ALP with arrow

06d1e19

Add alp benchmark

a98c594

Add datasets for alp benchmarking

c297f97

Update cmake file

ab928e8

Move hpp files to h

6a95a59

Update flow digram and layout digram to use ASCII and not unicode cha…

865e46a

…racters

Rename cpp files to cc

cb6d0b6

Update documentation to align with arrow's doxygen style

496e23b

Adapt methods and variable names to arrow style

8803b52

Also ensure that no line exceeds 90 characters

Update the tests to adhere to arrow style code

31e94ec

Update callers

46c0ecc

Fuse FOR and decode loop

a70b08f

Reduce memory allocation in the decompress call

ccbb1dd

emkornfield reviewed Jun 4, 2026

View reviewed changes

emkornfield requested changes Jun 4, 2026

View reviewed changes

sfc-gh-pgaur added 9 commits June 8, 2026 03:17

Remove private inheritance from AlpCompression and AlpInlines

c96fef1

Use explicit AlpConstants:: qualification instead of private inheritance per reviewer feedback. Private inheritance is discouraged as it obscures the relationship between classes.

Replace trailing return type with explicit Result<AlpHeader> in LoadH…

2b245ca

…eader

Use SafeLoadAs for all header fields in LoadHeader

f151349

Validate header fields in LoadHeader

cb10222

Reject invalid compression_mode, integer_encoding, log_vector_size, and negative num_elements when loading the ALP page header.

prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from 894279f to 43d534d Compare June 8, 2026 19:55

Conversation

prtkgaur commented Dec 5, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Spec

Parquet Format PR

Dataset PR (parquet-testing)

What changes are included in this PR?

Are these changes tested?

Unit tests

Benchmark tests

Are there any user-facing changes?

DuckDB

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prtkgaur commented Dec 5, 2025 •

edited by alamb

Loading